SWE-bench Verified

A human-validated benchmark of 500 real-world GitHub issues testing whether AI can autonomously resolve software engineering problems

Published

September 13, 2025

Keywords: SWE-bench Verified, software engineering benchmark, AI coding evaluation, GitHub issue resolution, autonomous coding agents, Princeton NLP, OpenAI Preparedness, patch generation, real-world bugs, LLM leaderboard, mini-SWE-agent, coding agent benchmark

Introduction

Writing code that passes toy programming puzzles is one thing. Fixing real bugs in production-grade open-source repositories — where you must navigate thousands of files, understand project conventions, and produce a patch that passes hidden tests — is an entirely different challenge.

SWE-bench Verified is a human-validated benchmark of 500 real-world software engineering problems drawn from GitHub issues across 12 popular Python repositories. Each task gives an AI agent access to a full codebase and an issue description, then asks it to generate a patch that resolves the problem. The patch is evaluated against hidden unit tests that the agent never sees.

“SWE-bench Verified is a subset of the original test set from SWE-bench, consisting of 500 samples verified to be non-problematic by our human annotators. This version supersedes the original SWE-bench and SWE-bench Lite test sets.” — OpenAI, Introducing SWE-bench Verified

graph LR
    A["Original SWE-bench<br/>2,294 tasks<br/>Some infeasible"] --> B["Human Annotation<br/>93 software developers<br/>screened 1,699 samples"]
    B --> C["SWE-bench Verified<br/>500 validated tasks<br/>Clear & solvable"]
    C --> D["Reliable measure<br/>of AI coding<br/>capability"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is SWE-bench Verified?

The original SWE-bench (ICLR 2024 Oral) was a breakthrough: 2,294 software engineering problems sourced from real GitHub issues and pull requests across 12 Python repositories (Django, scikit-learn, sympy, matplotlib, Flask, etc.). But the original benchmark had issues — 68.3% of samples were flagged by human reviewers for underspecified problem statements, unfair unit tests, or setup problems that could reject correct solutions.

SWE-bench Verified fixes this. OpenAI’s Preparedness team collaborated with the SWE-bench authors to have 93 professional software developers manually review each sample. Only tasks with clear problem descriptions, fair evaluation tests, and reliable environments were kept — resulting in a curated set of 500 high-quality tasks.

How It Works

Step Description
Input Agent receives a full codebase + GitHub issue description
Task Generate a patch (code edit) that resolves the described issue
Evaluation Hidden FAIL_TO_PASS tests must pass (issue resolved) AND PASS_TO_PASS tests must pass (nothing broken)
Metric % Resolved — percentage of the 500 tasks fully solved

graph TD
    A["GitHub Issue<br/>Description"] --> B["AI Agent"]
    C["Full Codebase<br/>(thousands of files)"] --> B
    B --> D["Generated Patch<br/>(code changes)"]
    D --> E{"Hidden Tests"}
    E -->|"FAIL_TO_PASS ✅<br/>PASS_TO_PASS ✅"| F["✅ Resolved"]
    E -->|"Any test fails"| G["❌ Not Resolved"]

    style A fill:#3498db,stroke:#333,color:#fff
    style C fill:#3498db,stroke:#333,color:#fff
    style B fill:#9b59b6,stroke:#333,color:#fff
    style D fill:#f39c12,stroke:#333,color:#fff
    style E fill:#2c3e50,stroke:#333,color:#fff
    style F fill:#27ae60,stroke:#333,color:#fff
    style G fill:#e74c3c,stroke:#333,color:#fff

Who Built It?

SWE-bench was originally created at Princeton NLP by:

  • Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

The Verified subset was produced in collaboration with OpenAI’s Preparedness team:

  • Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, and others, with senior leads Tejal Patwardhan, Kevin Liu, and Aleksander Mądry

Publication & Timeline

Date Milestone
October 2023 SWE-bench paper published (arXiv:2310.06770)
January 2024 Accepted as ICLR 2024 Oral
June 2024 Docker-ized evaluation harness released
August 2024 SWE-bench Verified released — 500 human-validated tasks
October 2024 SWE-bench Multimodal released (ICLR 2025)
July 2025 mini-SWE-agent scores 65% on Verified in 100 lines of Python

What Skills Does It Test?

SWE-bench Verified evaluates the end-to-end software engineering capabilities of AI systems — from understanding a bug report to producing a working fix.

graph TD
    A["SWE-bench Verified<br/>500 Real-World Tasks"] --> B["Code Understanding<br/>Navigate large codebases<br/>across multiple files"]
    A --> C["Bug Diagnosis<br/>Interpret issue descriptions<br/>and reproduce bugs"]
    A --> D["Patch Generation<br/>Edit code to fix issues<br/>without breaking anything"]
    A --> E["Multi-File Reasoning<br/>Coordinate changes across<br/>functions, classes, files"]
    A --> F["Testing Awareness<br/>Produce fixes that pass<br/>hidden unit tests"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#3498db,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#f39c12,stroke:#333,color:#fff
    style E fill:#9b59b6,stroke:#333,color:#fff
    style F fill:#2c3e50,stroke:#333,color:#fff

Capability What Is Tested
Codebase Navigation Finding relevant files and functions in repositories with thousands of files
Issue Comprehension Understanding natural-language bug reports, feature requests, and ambiguous problem descriptions
Code Generation Writing correct patches that resolve issues — not just generating new code from scratch
Regression Safety Ensuring fixes don’t break existing functionality (PASS_TO_PASS tests)
Tool Use Interacting with execution environments, running commands, reading outputs
Long-Context Reasoning Processing extremely long contexts spanning multiple files and directories

The 12 Python Repositories

Tasks are drawn from real issues in: Django, scikit-learn, sympy, matplotlib, Flask, requests, pytest, astropy, sphinx, xarray, pylint, and seaborn.

Human-Verified Quality

The annotation campaign assessed each task on three dimensions:

  • Problem Statement Clarity (scale 0–3): Is the issue well-specified? Can the agent understand what to fix?
  • Test Validity (scale 0–3): Do the FAIL_TO_PASS tests fairly evaluate solutions? Or do they reject valid fixes?
  • Difficulty (estimated developer time): <15 min, 15 min–1 hr, 1–4 hr, >4 hr

Any sample flagged at severity 2+ by any single annotator (out of 3) was removed — a conservative approach yielding high confidence in the remaining 500 tasks.

Dashboard — SWE-bench Verified Leaderboard

Bash-Only Model Evaluation (mini-SWE-agent)

To enable fair apples-to-apples comparison of language models, all models below are evaluated using mini-SWE-agent — a minimal 100-line bash-only agent loop with no special tools, RAG, or scaffolding. These results reflect raw LM capability when given just a bash shell and a problem.

Rank Model % Resolved Cost per Instance
1 Claude 4.5 Opus (high reasoning) 76.80 $0.75
2 Gemini 3 Flash (high reasoning) 75.80 $0.36
2 MiniMax M2.5 (high reasoning) 75.80 $0.07
4 Claude Opus 4.6 75.60 $0.55
5 GPT-5.2 Codex 72.80 $0.45
5 GLM-5 (high reasoning) 72.80 $0.53
5 GPT-5.2 (high reasoning) 72.80 $0.47
8 Claude 4.5 Sonnet (high reasoning) 71.40 $0.66
9 Kimi K2.5 (high reasoning) 70.80 $0.15
10 DeepSeek V3.2 (high reasoning) 70.00 $0.45
11 Gemini 3 Pro 69.60 $0.96
12 Claude 4.5 Haiku (high reasoning) 66.60 $0.33
13 GPT-5 Mini 56.20 $0.05

Source: swebench.com — Verified leaderboard, mini-SWE-agent evaluation (consulted March 29, 2026)

Key Insights from the Results

graph TD
    A["SWE-bench Verified<br/>Leaderboard Insights"] --> B["Top Tier: ~75%<br/>Claude 4.5 Opus<br/>Gemini 3 Flash<br/>MiniMax M2.5"]
    A --> C["Cost Efficiency<br/>MiniMax M2.5: 75.8%<br/>at only $0.07/task"]
    A --> D["Gap to 100%<br/>~25% of tasks<br/>remain unsolved"]
    A --> E["Model Size Matters<br/>GPT-5 Mini: 56.2%<br/>vs GPT-5.2: 72.8%"]

    style A fill:#2c3e50,stroke:#333,color:#fff
    style B fill:#27ae60,stroke:#333,color:#fff
    style C fill:#3498db,stroke:#333,color:#fff
    style D fill:#e74c3c,stroke:#333,color:#fff
    style E fill:#f39c12,stroke:#333,color:#fff

  1. The top tier reaches ~76% resolved — remarkable progress from Claude 2’s original 1.96% on SWE-bench in 2023
  2. Cost varies dramatically — MiniMax M2.5 achieves 75.8% at just $0.07/task, while Gemini 3 Pro costs $0.96 for 69.6%
  3. ~25% of tasks remain unsolved even by the best models — the hardest real-world bugs still defeat frontier AI
  4. Reasoning modes help significantly — models with “high reasoning” consistently outperform their default configurations
  5. Smaller models lag behind — GPT-5 Mini at 56.2% vs GPT-5.2 at 72.8% shows the importance of model scale for complex SWE tasks

Historical Context — The Rapid Rise

Date Best Score Model/Agent
October 2023 1.96% Claude 2 (original SWE-bench paper)
March 2024 12.47% SWE-agent
August 2024 33.2% GPT-4o + Agentless (on Verified)
July 2025 65.0% mini-SWE-agent
February 2026 76.8% Claude 4.5 Opus (high reasoning)

The improvement from 1.96% → 76.8% in just over two years represents one of the fastest capability gains in any AI benchmark.

From SWE-bench to SWE-bench Verified — Why the Upgrade?

The original SWE-bench systematically underestimated model capabilities because:

graph LR
    A["Problem 1<br/>Underspecified issues<br/>38.3% flagged"] --> D["SWE-bench Verified<br/>Filters out all<br/>problematic samples"]
    B["Problem 2<br/>Unfair unit tests<br/>61.1% flagged"] --> D
    C["Problem 3<br/>Setup failures<br/>causing false negatives"] --> D
    D --> E["500 high-quality<br/>validated tasks"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#e74c3c,stroke:#333,color:#fff
    style C fill:#e74c3c,stroke:#333,color:#fff
    style D fill:#27ae60,stroke:#333,color:#fff
    style E fill:#3498db,stroke:#333,color:#fff

  • 38.3% of samples had underspecified problem statements — ambiguous issues that even human developers would struggle with
  • 61.1% had unit tests that could unfairly reject valid solutions — e.g., requiring exact string matches on deprecation messages not mentioned in the issue
  • Overall, 68.3% of original samples were filtered out during the verification process

This doesn’t make SWE-bench Verified “easier” — it makes it fairer. Performance increases within individual difficulty categories confirm that the improvement comes from removing impossible tasks, not just easy ones.

Where to Explore

Resource Link
Verified Leaderboard swebench.com/verified.html
Full Leaderboard (all agents) swebench.com
Results Viewer swebench.com/viewer.html
HuggingFace Dataset huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
GitHub Repository github.com/SWE-bench/SWE-bench
arXiv Paper arxiv.org/abs/2310.06770
OpenAI Blog (Verified) openai.com/index/introducing-swe-bench-verified
mini-SWE-agent github.com/SWE-agent/mini-swe-agent
License MIT License

Load the Dataset

from datasets import load_dataset

swebench_verified = load_dataset(
    "princeton-nlp/SWE-bench_Verified", split="test"
)

Watch the Video

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

References

  1. Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770.
  2. OpenAI Preparedness. (2024). Introducing SWE-bench Verified. openai.com/index/introducing-swe-bench-verified.
  3. Yang, J., Jimenez, C.E., Zhang, A.L., et al. (2025). SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? ICLR 2025. arXiv:2410.03859.
  4. SWE-bench Team. SWE-bench Official Leaderboard. swebench.com.

Read More

  • LiveCodeBench Pro — competitive programming benchmark with contamination-free evaluation
  • SimpleQA — measuring short-form factuality and hallucination in LLMs
  • Humanity’s Last Exam — the hardest AI benchmark across 100+ academic disciplines
  • GPQA Diamond — graduate-level science questions for expert reasoning
  • ARC-AGI-2 — abstract reasoning that challenges pattern recognition beyond training data